A 90nm 1GHz 22mW 16x16-bit 2’s Complement Multiplier for Wireless Baseband
نویسندگان
چکیده
Abstract This paper describes a static 16x16-bit 2’s complement wireless baseband multiplier testchip in 1.2V, 90nm dual-Vt CMOS technology. One-hot Booth encoding, sum/delay difference optimized 3:2 compressor tree, and signal-profile optimized final adder schemes are employed to achieve 1GHz, 22mW operation at 1.2V, scalable to 500MHz, 3mW at 0.8V. Introduction Short bit-width (≤16-bit) radix-2 2’s complement multipliers are performance and power-critical components for wireless baseband signal processing. Clusters of parallel multiplier/multiply-add units are required for complex filter operations with single-cycle latency and throughput demand, while consuming low switching energy and active/standby leakage. A 16x16-bit 2’s complement static wireless baseband multiplier testchip is described in 90nm dual-Vt 7-metal CMOS process [1]. 1GHz operation is achieved at 1.2V, scaling to 500MHz at 0.8V. One-hot Booth encoding, sum/carry output delay-difference optimized 3:2 compressor tree, and signal-profile optimized final adder schemes are proposed to achieve low wiring complexity and ultra low power dissipation of 22mW/3mW at 1.2V/0.8V. Power cutoff sleep-transistor on the Vcc rail enables standby mode leakage reduction. Single-rail wiring is used throughout, resulting in a dense layout occupying 215μmx130μm (Fig. 9). Multiplier Organization Fig. 1 shows the multiplier organization, which is partitioned into 3 stages. The first stage generates 2’s complement partial products, which is conventionally performed using Booth encoding with sign-extension. However, sign extension requires 30% extra transistors along boundary of the partial product tree and longer wire-loading on the Booth multiplexors, resulting in uneven layout topology (Fig. 2(a)) [2]. Fig. 2(b) shows the optimized one-hot Booth encoding scheme and selection table, which reduces the partial product selection circuit to a single 6:1 multiplexor with minimized contention (Fig. 3). Total partial product bits are reduced from 208 to 160 (23% reduction), resulting in 15% total energy (switching + active leakage) savings. In the second stage, a partial-product reduction tree (PPRT) compresses the Booth-encoded partial products via 3:2 compressors to produce two 32-bit carry-save format inputs for the final output adder. To improve performance, delay difference between sum and carry outputs of 3:2 static compressor are exploited to reoptimize the PPRT to minimize total horizontal and vertical tree propagation delays [3]. Fig. 4(a) shows the 3:2 compressor and sum vs. carry output worst-case 90nm delay comparisons at 0.8V, 110°C. Fastarriving carry_out of each PPRT stage (31% earlier arrival time) are connected to slow upper-stack (A, B) inputs of the next PPRT stage (Fig. 4(b)), resulting in an overall 8% reduction in total PPRT critical-path delay compared to conventional Wallace-tree approach. The third stage adds the two PPRT outputs via a fast 32-bit carry-propagate adder to produce final 32-bit result. Conventional implementations use the fastest binary adder scheme, assuming simultaneous arrival of all PPRT outputs. Fig. 5 shows the actual PPRT signal arrival profile, showing 8x difference between earliest and latest arriving final adder inputs. To exploit this delay profile, a multi-architecture signal-profile optimized final addition scheme is used: low area/power consuming ripple carry adder (RCA) topology for the early arriving sections while a variable block adder (VBA) [4] and conditional sum adder are used for the late arriving sections. Arrival of PPRT output bits <7:0> is less than the delay of an 8-bit RCA, and thus an optimized RCA with fast carry-propagate delays (Fig. 6) is used for these bits. Bits <23:8> arrive at approximately the same time, and a 16-bit VBA that achieves the performance of a 16-bit carry lookahead scheme is used. Upper order bits <31:24>, which also arrive early, are computed using a conditional sum adder based on two 8-bit RCAs. This results in a single pass-gate mux delay contribution to the critical path for the final 8-bits, further improving critical-path performance. Total final adder energy reduction due to proposed scheme is 20%. Write-port based latches are used at the input and output clock boundaries [5], resulting in 36% lower latch energy and 14% reduction in total clock energy (including clock drivers). Latch transistors use minimum permissible device sizes and output drivers are gained-up to drive the large Booth encoder and select-mux loads. . The multiplier layout is surrounded by a 200μm (equivalent device width) power-switch sleep transistor on the Vcc rail (Fig. 9) to enable standby leakage reduction. This is especially important in the context of wireless DSP tasks to reduce leakage of unused multipliers in large filter banks. 90nm Energy-Delay Results and Scaling Performance Complete multiplier operates at 1GHz (1.2V, 110°C simulation), and consumes 22mW total power with worst-case input vector conditions. Active leakage power consumption is 778μW. Table 1 shows the energy benefit achieved by each scheme, resulting in a cumulative 12% total energy reduction for proposed multiplier. To reduce total active leakage and switching power, minimum permissible device sizes are used throughout this implementation, with selective upsizing only on the Booth encoder and final adder critical path devices. Since the energy contribution of Booth encoder and final adder is 23% of total worst-case energy (Fig. 7), this selective upsizing resulted in a good energy-delay trade-off: total critical-path delay improvement of 41% with only 8% penalty in total energy. Fig. 8 shows energy-delay behavior of the multiplier with lowering Vcc. At 0.8V, the multiplier operates at 500MHz with 3mW total worst-case power and 192μW active leakage power. The dense Booth encoding, PPRT, and final adder schemes result in low average transistor sizes and consequently low active leakage energy component (≤6%), minimizing the impact of higher leakage in future technologies. Further, the associated decrease in Booth encoder and PPRT interconnect reduces the effect of increased wire delay in future technologies. In a 65nm technology, where device leakage is expected to increase by 3-5x [6], we project 28% delay improvement and 50% energy reduction, with a low (10%) active leakage energy component. Conclusion A 90nm wireless baseband 16x16-bit 2’s complement multiplier testchip is described that achieves 1GHz, 22mW operation at 1.2V, scaling to 500MHz, 3mW at 0.8V. Acknowledgement The authors thank S. Pawlowski, K. Soumyanath, T. Chun, L. Snyder, E. Tsui, G. Gerosa for discussions; and J. Rattner, M. Haycock for encouragement and support. References [1] S. Thompson et al, 2002 IEDM Tech. Digest, pp. 61-64. [2] N. Itoh et al, 1999 VLSI Circuits Symp. Digest, pp. 15-16. [3] V. Oklobdzija et al, IEEE Trans. Computers, March 1996, pp294-306. [4] C. Martel et al, IEEE Trans. Computers, March 1998, pp273-285. [5] R.Krishnamurthy et al, 2002 VLSI Circuits Symp. Digest, pp.128-129. [6] T. Ghani et al, 2000 VLSI Tech. Symp., pp 174-175.
منابع مشابه
A 2GHz 13
Two’s complement multipliers are performance and power-critical components for wireless baseband signal processing applications. Parallel clusters of multiplier, multiply-add, multiply-accumulate cores are required to perform complex filter operations in Fast Fourier Transform (FFT) accelerators while consuming ultra low energy/operation [1]. A 12x9b single-cycle two’s complement twiddle multip...
متن کاملLow Power High Speed 16x16 bit Multiplier using Vedic Mathematics
High-speed parallel multipliers are one of the keys in RISCs (Reduced Instruction Set Computers), DSPs (Digital Signal Processors), and graphics accelerators and so on. Array multiplier, Booth Multiplier and Wallace Tree multipliers are some of the standard approaches used in implementation of binary multiplier which are suitable for VLSI implementation. A simple digital multiplier (henceforth ...
متن کاملA 65 μW, 1.9 GHz RF to digital baseband wakeup receiver for wireless sensor nodes
A complete 1.9GHz receiver, with BAW resonatorreferenced input matching network, is designed as a wakeup receiver for wireless sensor networks. The 90nm CMOS chip includes RF amplifier, PGA, ADC, and reference generation, while consuming 65μW from a single 0.5V supply. The input RF bandwidth of the receiver is 7MHz, while the maximum data rate is 100kbps. When detecting a 31-bit sequence, the r...
متن کاملLow-power Full Adder array-based Multiplier with Domino Logic
ABSTRACT : A circuit design for a low-power full adder array-based multiplier in domino logic is proposed. It is based on Wallace tree technique. Clocked architecture results in lower power dissipation and improvements in power-delay product. The proposed technique is general and can be used in all domino logic circuit designs. Higher order multipliers like 16x16, 32x32 may also be implemented ...
متن کاملDesign and Implementation of Multiplier Using Kcm and Vedic Mathematics by Using Reversible Adder
This work is devoted for the design and FPGA implementation of a 16bit Arithmetic module, which uses Vedic Mathematics algorithms. For arithmetic multiplication various Vedic multiplication techniques like Urdhva Tiryakbhyam Nikhilam and Anurupye has been thoroughly analyzed. Also Karatsuba algorithm for multiplication has been discussed. It has been found that Urdhva Tiryakbhyam Sutra is most ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003